feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector by kfirtoledo · Pull Request #607 · llm-d/llm-d-kv-cache

kfirtoledo · 2026-05-26T10:08:59Z

Summary

Replace the per-(block, layer) cudaMemcpyAsync loop in TensorCopier with a single cudaMemcpyBatchAsync (CUDA 12.8+) submission. Submits all descriptors in one driver call, removing per-call dispatch overhead.

Enabled by default; toggle off via USE_BATCH_MEMCPY_READ=0 / USE_BATCH_MEMCPY_WRITE=0.
Per-call DMA loop kept as fallback for older CUDA toolkits and A/B debugging.
srcAccessOrder=ANY set on cudaMemcpyAttributes (matches vLLM's simple_kv_offload/cuda_mem_ops.py).
#if CUDA_VERSION handles the failIdx out-param that CUDA 13 dropped.

Measured impact (128k tokens, TP=4, `--block-size 512`)

Workload	no-BATCH	BATCH	speedup
gpt-oss-20b (HMA) cold	3.35s	1.77s	1.9x
gpt-oss-20b (HMA) hot	0.58s	0.30s	1.9x
gpt-oss-120b (HMA) cold	4.95s	2.51s	2.0x
gpt-oss-120b (HMA) hot	0.83s	0.34s	2.4x
Llama-3.1-8B hot	0.45s	0.43s	neutral
Llama-3.1-70B hot	0.84s	0.85s	neutral

Big wins on HMA models where per-layer DMAs are small; neutral on Llama/no-HMA where each per-call copy is already large enough that driver dispatch is amortized.

Test plan

`make test` — 30 passed, 3 skipped (all on this branch, batched path active by default).
Manual storage roundtrip on Llama-3.1-8B / 70B + gpt-oss-20b/120b with batch enabled and disabled (see table).

Etelis

Fallback on << 12800 o.w LGTM

Etelis · 2026-05-28T15:49:49Z

+// Batched DMA path: one cudaMemcpyBatchAsync covers all per-(block, layer)
+// copies for the blocks in this file (num_blocks * num_tensors).
+// The batch executes in stream order; ordering within the batch is unspecified.
+void TensorCopier::copy_blocks_via_batch_memcpy(


I'm not sure it's relevant, but HIP (AMD) also supports batch memcpy out of the box, and might be worth adding as well.

github-actions · 2026-06-02T11:09:16Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Etelis

lgtm

Submit all per-(block, layer) copies in one driver call instead of N cudaMemcpyAsync calls. Enabled by default; toggle off with USE_BATCH_MEMCPY_READ / USE_BATCH_MEMCPY_WRITE=0. Requires CUDA 12.8+. Speeds up KV-cache offload writes/reads when per-layer DMA sizes are small enough that driver dispatch dominates. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

cudaMemcpyBatchAsync was introduced in CUDA 12.8 — guard the batch path with #if CUDA_VERSION >= 12080 and route to the per-call cudaMemcpyAsync loop below that. Default USE_BATCH_MEMCPY_* off on older toolchains so the env knob still makes sense. Also drop thread_local on the attrs/attrs_idx inputs (never mutated, no per-thread duplication needed) and move the copy_blocks dispatcher below the helpers it dispatches to. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

dannyharnik

LGTM
/approve

kfirtoledo requested review from dannyharnik, liu-cong and vMaroon as code owners May 26, 2026 10:09

github-actions Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2026

github-actions Bot requested review from hyeongyun0916, sagearc and yankay May 26, 2026 10:09

kfirtoledo force-pushed the batch-memcpy branch from c7fd16e to 191f072 Compare May 26, 2026 10:17

github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 26, 2026

Etelis reviewed May 28, 2026

View reviewed changes

kfirtoledo force-pushed the batch-memcpy branch from 9f62073 to e97f213 Compare June 2, 2026 11:09

Etelis approved these changes Jun 3, 2026

View reviewed changes

kfirtoledo added 2 commits June 3, 2026 03:51

kfirtoledo force-pushed the batch-memcpy branch from e97f213 to dd176b7 Compare June 3, 2026 07:53

dannyharnik approved these changes Jun 4, 2026

View reviewed changes

kfirtoledo merged commit 59aa98f into llm-d:main Jun 4, 2026
11 checks passed

miroslavln mentioned this pull request Jun 10, 2026

fix/issue 656 default block size factor miroslavln/llm-d-kv-cache#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector#607

feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector#607
kfirtoledo merged 2 commits into
llm-d:mainfrom
kfirtoledo:batch-memcpy

kfirtoledo commented May 26, 2026

Uh oh!

Etelis left a comment

Uh oh!

Uh oh!

Etelis May 28, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Etelis left a comment

Uh oh!

dannyharnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kfirtoledo commented May 26, 2026

Summary

Measured impact (128k tokens, TP=4, --block-size 512)

Test plan

Uh oh!

Etelis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Etelis May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Etelis left a comment

Choose a reason for hiding this comment

Uh oh!

dannyharnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Measured impact (128k tokens, TP=4, `--block-size 512`)